Using word n-grams to identify authors and idiolects

نویسندگان

چکیده

برای دانلود باید عضویت طلایی داشته باشید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Beyond Word N-Grams

We describe, analyze, and experimentally evaluate a new probabilistic model for wordsequence prediction in natural languages, based on prediction suffi~v trees (PSTs). By using efficient data structures, we extend the notion of PST to unbounded vocabularies. We also show how to use a Bayesian approach based on recursive priors over all possible PSTs to efficiently maintain tree mixtures. These ...

متن کامل

Using idiolects and sociolects to improve word prediction

In this paper the word prediction system Soothsayer1 is described. This system predicts what a user is going to write as he is keying it in. The main innovation of Soothsayer is that it not only uses idiolects, the language of one individual person, as its source of knowledge, but also sociolects, the language of the social circle around that person. We use Twitter for data collection and exper...

متن کامل

Variable word rate N-grams

The rate of occurrence of words is not uniform but varies from document to document. Despite this observation, parameters for conventional n-gram language models are usually derived using the assumption of a constant word rate. In this paper we investigate the use of variable word rate assumption, modelled by a Poisson distribution or a continuous mixture of Poissons. We present an approach to ...

متن کامل

Selecting and Weighting N-Grams to Identify 1100 Languages

This paper presents a language identification algorithm using cosine similarity against a filtered and weighted subset of the most frequent n-grams in training data with optional inter-string score smoothing, and its implementation in an open-source program. When applied to a collection of strings in 1100 languages containing at most 65 characters each, an average classification accuracy of ove...

متن کامل

Enhancing News Articles Clustering using Word N-Grams

In this work we explore the possible enhancement of the document clustering results, and in particular clustering of news articles from the web, when using word-based n-grams during the keyword extraction phase. We present and evaluate a weighting approach that combines clustering of news articles derived from the web using n-grams, extracted from the articles at an offline stage. We compared t...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: International Journal of Corpus Linguistics

سال: 2017

ISSN: 1384-6655,1569-9811

DOI: 10.1075/ijcl.22.2.03wri